BIOL5081: Biostatistics Intro

The first six weeks of the course will cover data science & fundamental exploratory data analysis in r. It is a brief introduction to the contemporary open science, best-practice data and biostatistical working tools. The primary goal is exploration of tools and approaches associated with effective, efficient, reproducible biostatistical analyses. Inspirations include software carpentry, rforcats, and many, more open, online, free I can direct you to as needed. The description provided by the university is a useful starting point in deciding if the Fall 2016 offering meets your specific needs.

use.ful

bio.stats

adventure.time

Course outline

Lesson 1. Data science (DS).

want data. have data. will collect data. assumption: in this course, you are here to work with data. data literacy IS data science.

WhyR introductory lecture

The importance of data viz

Philosophy of R stats Statistical thinking: likelihood, error, and effect sizes. Contemporary statisticians embrace and are mindful of these three key concepts in all research, data, and statistical inference examinations.

Modes of inference with your data: using data, what can you infer or do? 1. Data description 2. Likelihood 3. Estimation 4. Baysian inference - weight and include what we know 5. Prediction 6. Hypothesis testing 7. Decision making -balance gains and risks

This set of ideas provide the foundation for many data science and statistical approached to working with evidence within almost every domain of science.

Data viz first and foremost. Case study #1.

#blog post by Fisseha Berhane
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(reshape2)

#Create four groups: setA, setB, setC and setD.
setA=select(anscombe, x=x1,y=y1)
setB=select(anscombe, x=x2,y=y2)
setC=select(anscombe, x=x3,y=y3)
setD=select(anscombe, x=x4,y=y4)

#Add a third column which can help us to identify the four groups.
setA$group ='SetA'
setB$group ='SetB'
setC$group ='SetC'
setD$group ='SetD'

#merge the four datasets
all_data=rbind(setA,setB,setC,setD)  # merging all the four data sets
all_data[c(1,13,23,43),]  # showing sample
##     x    y group
## 1  10 8.04  SetA
## 13  8 8.14  SetB
## 23 10 7.46  SetC
## 43  8 7.91  SetD
#compare summary stats
summary_stats =all_data%>%group_by(group)%>%summarize("mean x"=mean(x),
                                       "Sample variance x"=var(x),
                                       "mean y"=round(mean(y),2),
                                       "Sample variance y"=round(var(y),1),
                                       'Correlation between x and y '=round(cor(x,y),2)
                                      )
models = all_data %>% 
      group_by(group) %>%
      do(mod = lm(y ~ x, data = .)) %>%
      do(data.frame(var = names(coef(.$mod)),
                    coef = round(coef(.$mod),2),
                    group = .$group)) %>%
dcast(., group~var, value.var = "coef")
## Warning in bind_rows_(x, .id): Unequal factor levels: coercing to character
summary_stats_and_linear_fit = cbind(summary_stats, data_frame("Linear regression" =
                                    paste0("y = ",models$"(Intercept)"," + ",models$x,"x")))

summary_stats_and_linear_fit
##   group mean x Sample variance x mean y Sample variance y
## 1  SetA      9                11    7.5               4.1
## 2  SetB      9                11    7.5               4.1
## 3  SetC      9                11    7.5               4.1
## 4  SetD      9                11    7.5               4.1
##   Correlation between x and y  Linear regression
## 1                         0.82      y = 3 + 0.5x
## 2                         0.82      y = 3 + 0.5x
## 3                         0.82      y = 3 + 0.5x
## 4                         0.82      y = 3 + 0.5x
#data viz instead as first step
ggplot(all_data, aes(x=x,y=y)) +geom_point(shape = 21, colour = "red", fill = "orange", size = 3)+
    ggtitle("Anscombe's data sets")+geom_smooth(method = "lm",se = FALSE,color='blue') + 
    facet_wrap(~group, scales="free")

Outcome from stats first, data viz later (tricked), descriptive estimates of data can be deceptive. Draw first, then interpret.

Survey data from class. Case study #2.

#load class survey responses from google poll completed in class
survey<-read.csv("data/5081.survey.1.csv")
str(survey) #check data match what we collected
## 'data.frame':    18 obs. of  4 variables:
##  $ r.experience : int  2 1 2 3 1 1 3 1 1 3 ...
##  $ discipline   : Factor w/ 4 levels "ecology","environmental science",..: 4 3 4 4 1 4 1 1 1 3 ...
##  $ research.data: Factor w/ 2 levels "qualitative",..: 2 2 2 2 2 2 2 1 2 2 ...
##  $ r.studio     : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 2 1 2 1 ...
#data viz
hist(survey$r.experience, xlab="experience in R (1 is none, 5 is extensive)", ylab="frequency", main="Likert Scale 1 to 5")

plot(survey$r.experience~survey$discipline, xlab="discipline", ylab="experience in R")

plot(survey$r.studio, ylab="R Studio")

plot(survey$research.data, ylab="Research data")

#observe patterns by checking plots

Observations from data viz We have limited experience in R. Experience in R varies by research discipline. A total of half the respondents have tried R Studio. Most participants will be working with quantitative data in their own research projects.

#Now, try some simple summary statistics.
summary(survey)
##   r.experience                   discipline      research.data r.studio
##  Min.   :1.000   ecology              :6    qualitative : 1    No :9   
##  1st Qu.:1.000   environmental science:1    quantitative:17    Yes:9   
##  Median :1.000   genetics             :2                               
##  Mean   :1.667   physiology           :9                               
##  3rd Qu.:2.000                                                         
##  Max.   :3.000
#Data summary looks reasonable based on plots, mean R experience is < 2
t.test(survey$r.experience, mu=1) #t-test if mean is different from 1
## 
##  One Sample t-test
## 
## data:  survey$r.experience
## t = 3.3665, df = 17, p-value = 0.003664
## alternative hypothesis: true mean is not equal to 1
## 95 percent confidence interval:
##  1.248861 2.084472
## sample estimates:
## mean of x 
##  1.666667
t.test(survey$r.experience, mu=2) #t-test if mean is different from 2
## 
##  One Sample t-test
## 
## data:  survey$r.experience
## t = -1.6833, df = 17, p-value = 0.1106
## alternative hypothesis: true mean is not equal to 2
## 95 percent confidence interval:
##  1.248861 2.084472
## sample estimates:
## mean of x 
##  1.666667
#A one sample t-test confirms we have a bit experience in R.

m1<-glm(r.experience~discipline, family = poisson, data = survey) #test for differenes between disciplines in R experience
m1 #model summary
## 
## Call:  glm(formula = r.experience ~ discipline, family = poisson, data = survey)
## 
## Coefficients:
##                     (Intercept)  disciplineenvironmental science  
##                       5.108e-01                       -5.108e-01  
##              disciplinegenetics             disciplinephysiology  
##                       1.823e-01                       -6.678e-10  
## 
## Degrees of Freedom: 17 Total (i.e. Null);  14 Residual
## Null Deviance:       6.808 
## Residual Deviance: 6.371     AIC: 56.79
anova(m1, test="Chisq") #test whether the differences in model are different
## Analysis of Deviance Table
## 
## Model: poisson, link: log
## 
## Response: r.experience
## 
## Terms added sequentially (first to last)
## 
## 
##            Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL                          17     6.8075         
## discipline  3  0.43692        14     6.3706   0.9325
#Too little evidence to be significantly different between disciplines.

Practical skill outcomes of R stats useful for competency test Understand the difference between R and R Studio. Use scripts or R Markdown files to save all your work. Be prepared to share you code. Load data, clean data, visualize data, then and only then begin applying statistics. Proximately: be able to use and work with dataframes, vectors, functions, and libraries in R.

Lesson 2. Workflows & Data Wrangling (WDW).

worflows reproduce. openly.

data wrangling more than half the battle.

Data wrangling slide deck

Philosophy of R stats Tidy data make your life easier. Data strutures should match intuition and common sense. Data should have logical structure. Rows are are observations, columns are variables. Tidy data also increase the viability that others can use your data, do better science, reuse science, and help you and your ideas survive and thrive. A workflow should also include the wrangling you did to get your data ready. If data are already very clean in a spreadsheet, they can easily become a literate, logical dataframe. Nonetheless, you should still use annotation within the introductory code to explain the meta-data of your data to some extent and what you did pre-R to get it tidy. The philosophy here is very similar to the data viz lesson forthcoming with two dominant paradigms. Base R code functions, and pipes %>% and the logic embodied within the libraries associated with the the tidyverse. Generally, I prefer the tidyverse because it is more logical and much easier to remember. It also has some specific functions needed to address common errors in modern data structures with little code needed to fix them up.

Worflow template for r-script meta-data #### author: cjlortie date: 2016 purpose:

set-up #### rm(list=ls()) getwd() setwd(“~/r”)

read & check data #### data <-read.csv(“filename.csv”) names() dim() str() summary(data)

visualize ####

check assumptions ####

model data and test hypothesis ####

Data wrangling

Base R key concepts: aggregate tapply sapply lappy subsetting as.factor is.numeric na

tidyverse key concepts: pipes are you best friend! %>%

dplyr filter for rows select for columns mutate for new variables summarise for bringing together many values

Excellent list of wrangling tools

Cheatsheet

tidyr

Great wrangling webinar

We will cover two basic challenges you will certainly encounter. Missing data

#Missing data. In error, missing cells/observations in some measure can kick back an error. In other apps, sometimes ignored but can introduce error.
ttc <-read.csv("data/ttc.csv")
str(ttc)
## 'data.frame':    31 obs. of  32 variables:
##  $ FARE.MEDIA: Factor w/ 21 levels "     BLIND/WAR AMPS",..: 19 13 12 16 10 11 7 14 15 2 ...
##  $ X2015     : int  NA 110945 NA NA 13323 204509 48396 NA 8843 48873 ...
##  $ X2014     : int  NA 111157 NA NA 9862 214932 42855 NA 9361 49120 ...
##  $ X2013     : int  NA 112360 NA NA 8194 213982 38426 NA 9557 48623 ...
##  $ X2012     : int  NA 117962 NA NA 4399 205086 35019 NA 10185 46467 ...
##  $ X2011     : int  NA 124748 NA NA 1139 194928 32091 NA 9893 43795 ...
##  $ X2010     : int  NA 120366 1298 NA 0 203101 9200 NA 9237 43149 ...
##  $ X2009     : int  NA 114686 8807 NA 0 208172 NA NA 8738 41445 ...
##  $ X2008     : int  NA 94210 34445 NA NA 203313 NA NA 7517 39408 ...
##  $ X2007     : int  NA 69134 65398 NA NA 195001 NA NA 7126 36317 ...
##  $ X2006     : int  NA 75340 68546 NA NA 171314 NA NA 5413 38684 ...
##  $ X2005     : int  NA 82162 73151 NA NA 140594 NA NA 1296 47521 ...
##  $ X2004     : int  NA 80859 72952 NA NA 125836 NA NA NA 55172 ...
##  $ X2003     : int  NA 80330 71485 NA NA 119681 NA NA NA 51328 ...
##  $ X2002     : int  NA 82102 74578 NA NA 116805 NA 762 NA 51052 ...
##  $ X2001     : int  NA 83771 70930 NA NA 118176 NA 1118 NA 58400 ...
##  $ X2000     : int  NA 82218 66331 NA NA 112081 NA 1105 NA 61539 ...
##  $ X1999     : int  NA 83028 64109 NA NA 103447 NA 1430 NA 54835 ...
##  $ X1998     : int  NA 85303 66490 NA NA 98473 NA 1649 NA 49658 ...
##  $ X1997     : int  NA 86991 66177 NA NA 91521 NA 1592 NA 46209 ...
##  $ X1996     : int  NA 87857 67164 4329 NA 86549 NA 1652 NA 32642 ...
##  $ X1995     : int  NA 87775 70369 12525 NA 96803 NA 1976 NA 20930 ...
##  $ X1994     : int  NA 97877 62700 13265 NA 96907 NA 2111 NA 20708 ...
##  $ X1993     : int  NA 104016 57710 12894 NA 100607 NA 2514 NA 22131 ...
##  $ X1992     : int  NA 114064 53655 11201 NA 109509 NA 2924 NA 23696 ...
##  $ X1991     : int  NA 111365 35788 17927 NA 108148 NA 3235 NA 60034 ...
##  $ X1990     : int  NA 119538 38369 22313 NA 116610 NA 2758 NA 67296 ...
##  $ X1989     : int  NA 114874 37401 27025 NA 113506 NA NA NA 65665 ...
##  $ X1988     : int  NA 122180 39514 18837 NA 119264 NA NA NA 66872 ...
##  $ X1987     : int  NA 127088 38944 8976 NA 109151 NA NA NA 75308 ...
##  $ X1986     : int  NA 126217 42052 7347 NA 101901 NA NA NA 66475 ...
##  $ X1985     : int  NA 128207 48793 NA NA 94970 NA NA NA 63986 ...
#check for missing values
is.na(ttc)
##       FARE.MEDIA X2015 X2014 X2013 X2012 X2011 X2010 X2009 X2008 X2007
##  [1,]      FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [2,]      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [3,]      FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE
##  [4,]      FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [5,]      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE
##  [6,]      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [7,]      FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE
##  [8,]      FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [9,]      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [10,]      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [11,]      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12,]      FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [13,]      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [14,]      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [15,]      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [16,]      FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [17,]      FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [18,]      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [19,]      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [20,]      FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [21,]      FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [22,]      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23,]      FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [24,]      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25,]      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [26,]      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [27,]      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [28,]      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [29,]      FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [30,]      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [31,]      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##       X2006 X2005 X2004 X2003 X2002 X2001 X2000 X1999 X1998 X1997 X1996
##  [1,]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [4,]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
##  [5,]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [7,]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [8,]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [9,] FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [10,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [11,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12,]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [13,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [14,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [15,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [16,]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [17,]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [18,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [19,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [20,]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [21,]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [22,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23,]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [24,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [26,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [27,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [28,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [29,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [30,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [31,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##       X1995 X1994 X1993 X1992 X1991 X1990 X1989 X1988 X1987 X1986 X1985
##  [1,]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [4,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
##  [5,]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [7,]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [8,] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [9,]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [10,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [11,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12,]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [13,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [14,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [15,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [16,]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE
## [17,]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [18,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [19,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [20,]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [21,]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [22,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23,]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [24,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [26,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [27,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [28,] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [29,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [30,] FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [31,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
summary(ttc, na.rm=TRUE) #excludes NA
##             FARE.MEDIA     X2015            X2014            X2013       
##       CASH       : 3   Min.   :    10   Min.   :    12   Min.   :   401  
##       PRESTO     : 3   1st Qu.:  1076   1st Qu.:  6087   1st Qu.:  7118  
##       TICKETS    : 3   Median : 12170   Median : 10802   Median : 10850  
##    SUB-TOTAL     : 3   Mean   : 68974   Mean   : 75560   Mean   : 77843  
##       WEEKLY PASS: 2   3rd Qu.: 48634   3rd Qu.: 49120   3rd Qu.: 52732  
##      TWO-FARE    : 2   Max.   :534005   Max.   :534815   Max.   :525194  
##  (Other)         :15   NA's   :8        NA's   :10       NA's   :11      
##      X2012            X2011            X2010            X2009       
##  Min.   :   372   Min.   :   344   Min.   :     0   Min.   :     0  
##  1st Qu.:  5141   1st Qu.:  4840   1st Qu.:  2539   1st Qu.:  4747  
##  Median : 11224   Median : 10690   Median :  9237   Median :  9844  
##  Mean   : 76162   Mean   : 74148   Mean   : 67353   Mean   : 69782  
##  3rd Qu.: 51249   3rd Qu.: 49146   3rd Qu.: 43149   3rd Qu.: 46170  
##  Max.   :514007   Max.   :500219   Max.   :477357   Max.   :471233  
##  NA's   :11       NA's   :11       NA's   :10       NA's   :11      
##      X2008            X2007            X2006            X2005       
##  Min.   :   310   Min.   :   295   Min.   :    58   Min.   :    93  
##  1st Qu.:  5334   1st Qu.:  4752   1st Qu.:  3978   1st Qu.:  3261  
##  Median : 11035   Median : 10892   Median : 10120   Median :  9606  
##  Mean   : 72806   Mean   : 71736   Mean   : 65906   Mean   : 63984  
##  3rd Qu.: 49701   3rd Qu.: 62491   3rd Qu.: 61156   3rd Qu.: 63630  
##  Max.   :466700   Max.   :459769   Max.   :444544   Max.   :431220  
##  NA's   :12       NA's   :12       NA's   :11       NA's   :11      
##      X2004            X2003            X2002            X2001       
##  Min.   :     0   Min.   :     0   Min.   :     0   Min.   :     0  
##  1st Qu.:  4184   1st Qu.:  3805   1st Qu.:  3264   1st Qu.:  2800  
##  Median :  9940   Median : 10586   Median : 10404   Median : 10765  
##  Mean   : 65379   Mean   : 63412   Mean   : 61538   Mean   : 62471  
##  3rd Qu.: 66028   3rd Qu.: 65337   3rd Qu.: 64830   3rd Qu.: 65670  
##  Max.   :418099   Max.   :405412   Max.   :415539   Max.   :419993  
##  NA's   :12       NA's   :12       NA's   :11       NA's   :11      
##      X2000            X1999            X1998            X1997       
##  Min.   :     0   Min.   :     0   Min.   :     0   Min.   :     0  
##  1st Qu.:  2558   1st Qu.:  2370   1st Qu.:  2283   1st Qu.:  1834  
##  Median : 10593   Median :  9655   Median :  9530   Median :  9157  
##  Mean   : 61079   Mean   : 58365   Mean   : 57806   Mean   : 56543  
##  3rd Qu.: 64276   3rd Qu.: 62428   3rd Qu.: 64004   3rd Qu.: 65008  
##  Max.   :410558   Max.   :392593   Max.   :388689   Max.   :379883  
##  NA's   :11       NA's   :11       NA's   :11       NA's   :11      
##      X1996            X1995            X1994            X1993       
##  Min.   :     0   Min.   :     0   Min.   :     0   Min.   :     0  
##  1st Qu.:  1652   1st Qu.:  1976   1st Qu.:  2111   1st Qu.:  2507  
##  Median :  9098   Median : 11338   Median : 12192   Median : 12544  
##  Mean   : 52806   Mean   : 55031   Mean   : 55096   Mean   : 58668  
##  3rd Qu.: 67164   3rd Qu.: 70369   3rd Qu.: 62700   3rd Qu.: 61242  
##  Max.   :372430   Max.   :388152   Max.   :388252   Max.   :393485  
##  NA's   :10       NA's   :10       NA's   :10       NA's   :11      
##      X1992            X1991            X1990            X1989       
##  Min.   :     0   Min.   :     0   Min.   :     0   Min.   :     0  
##  1st Qu.:  1858   1st Qu.:  2904   1st Qu.:  2914   1st Qu.:  2388  
##  Median : 11385   Median : 14268   Median : 15488   Median : 15177  
##  Mean   : 60409   Mean   : 66712   Mean   : 72297   Mean   : 70940  
##  3rd Qu.: 60610   3rd Qu.: 64236   3rd Qu.: 70048   3rd Qu.: 69216  
##  Max.   :404251   Max.   :424167   Max.   :459234   Max.   :450726  
##  NA's   :11       NA's   :12       NA's   :12       NA's   :12      
##      X1988            X1987            X1986            X1985       
##  Min.   :     0   Min.   :     0   Min.   :     0   Min.   :     0  
##  1st Qu.:  2525   1st Qu.:  3608   1st Qu.:  3427   1st Qu.:  2519  
##  Median : 16369   Median : 14344   Median : 13996   Median : 15652  
##  Mean   : 72982   Mean   : 75921   Mean   : 73280   Mean   : 76044  
##  3rd Qu.: 71772   3rd Qu.: 76750   3rd Qu.: 74446   3rd Qu.: 76829  
##  Max.   :463475   Max.   :456884   Max.   :441012   Max.   :432160  
##  NA's   :12       NA's   :13       NA's   :13       NA's   :14
new.ttc <-na.omit(ttc) # returns without missing values
is.na(new.ttc) #check to see if it worked
##    FARE.MEDIA X2015 X2014 X2013 X2012 X2011 X2010 X2009 X2008 X2007 X2006
## 2       FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 6       FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 10      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 11      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 13      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 14      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 15      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 18      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 19      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 22      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 24      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 25      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 26      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 27      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 31      FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##    X2005 X2004 X2003 X2002 X2001 X2000 X1999 X1998 X1997 X1996 X1995 X1994
## 2  FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 6  FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 10 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 11 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 13 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 14 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 15 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 18 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 19 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 22 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 24 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 25 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 26 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 27 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 31 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##    X1993 X1992 X1991 X1990 X1989 X1988 X1987 X1986 X1985
## 2  FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 6  FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 10 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 11 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 13 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 14 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 15 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 18 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 19 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 22 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 24 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 25 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 26 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 27 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 31 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#many other solutions but I use these two frequently

Selecting part of a dataset

library(dplyr)
survey<-read.csv("data/5081.survey.1.csv")
str(survey)
## 'data.frame':    18 obs. of  4 variables:
##  $ r.experience : int  2 1 2 3 1 1 3 1 1 3 ...
##  $ discipline   : Factor w/ 4 levels "ecology","environmental science",..: 4 3 4 4 1 4 1 1 1 3 ...
##  $ research.data: Factor w/ 2 levels "qualitative",..: 2 2 2 2 2 2 2 1 2 2 ...
##  $ r.studio     : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 2 1 2 1 ...
#I want just a simple dataframe with experience by discipline

experience <- survey %>% select(discipline, r.experience)
experience
##               discipline r.experience
## 1             physiology            2
## 2               genetics            1
## 3             physiology            2
## 4             physiology            3
## 5                ecology            1
## 6             physiology            1
## 7                ecology            3
## 8                ecology            1
## 9                ecology            1
## 10              genetics            3
## 11               ecology            3
## 12               ecology            1
## 13            physiology            1
## 14            physiology            2
## 15 environmental science            1
## 16            physiology            1
## 17            physiology            1
## 18            physiology            2
#Now I just want to select the physiology folks
physiologists <- experience %>% filter(discipline == "physiology")
physiologists
##   discipline r.experience
## 1 physiology            2
## 2 physiology            2
## 3 physiology            3
## 4 physiology            1
## 5 physiology            1
## 6 physiology            2
## 7 physiology            1
## 8 physiology            1
## 9 physiology            2
#Selections also often include a summary by levels or I want to make a new column with some calculations. Think about what you have likely done in excel.

#used pipes and made a nice summary table
experience <-survey %>% group_by(discipline) %>% summarise(
  count = n(),
  exp = mean (r.experience)
)

#What if I just want to make a new column to my dataframe that is a sum, a calculation, or some addition
ttc.5.years <- ttc %>% mutate(five.year.sum = X2015+X2014+X2013+X2012+X2011)
str(ttc.5.years)
## 'data.frame':    31 obs. of  33 variables:
##  $ FARE.MEDIA   : Factor w/ 21 levels "     BLIND/WAR AMPS",..: 19 13 12 16 10 11 7 14 15 2 ...
##  $ X2015        : int  NA 110945 NA NA 13323 204509 48396 NA 8843 48873 ...
##  $ X2014        : int  NA 111157 NA NA 9862 214932 42855 NA 9361 49120 ...
##  $ X2013        : int  NA 112360 NA NA 8194 213982 38426 NA 9557 48623 ...
##  $ X2012        : int  NA 117962 NA NA 4399 205086 35019 NA 10185 46467 ...
##  $ X2011        : int  NA 124748 NA NA 1139 194928 32091 NA 9893 43795 ...
##  $ X2010        : int  NA 120366 1298 NA 0 203101 9200 NA 9237 43149 ...
##  $ X2009        : int  NA 114686 8807 NA 0 208172 NA NA 8738 41445 ...
##  $ X2008        : int  NA 94210 34445 NA NA 203313 NA NA 7517 39408 ...
##  $ X2007        : int  NA 69134 65398 NA NA 195001 NA NA 7126 36317 ...
##  $ X2006        : int  NA 75340 68546 NA NA 171314 NA NA 5413 38684 ...
##  $ X2005        : int  NA 82162 73151 NA NA 140594 NA NA 1296 47521 ...
##  $ X2004        : int  NA 80859 72952 NA NA 125836 NA NA NA 55172 ...
##  $ X2003        : int  NA 80330 71485 NA NA 119681 NA NA NA 51328 ...
##  $ X2002        : int  NA 82102 74578 NA NA 116805 NA 762 NA 51052 ...
##  $ X2001        : int  NA 83771 70930 NA NA 118176 NA 1118 NA 58400 ...
##  $ X2000        : int  NA 82218 66331 NA NA 112081 NA 1105 NA 61539 ...
##  $ X1999        : int  NA 83028 64109 NA NA 103447 NA 1430 NA 54835 ...
##  $ X1998        : int  NA 85303 66490 NA NA 98473 NA 1649 NA 49658 ...
##  $ X1997        : int  NA 86991 66177 NA NA 91521 NA 1592 NA 46209 ...
##  $ X1996        : int  NA 87857 67164 4329 NA 86549 NA 1652 NA 32642 ...
##  $ X1995        : int  NA 87775 70369 12525 NA 96803 NA 1976 NA 20930 ...
##  $ X1994        : int  NA 97877 62700 13265 NA 96907 NA 2111 NA 20708 ...
##  $ X1993        : int  NA 104016 57710 12894 NA 100607 NA 2514 NA 22131 ...
##  $ X1992        : int  NA 114064 53655 11201 NA 109509 NA 2924 NA 23696 ...
##  $ X1991        : int  NA 111365 35788 17927 NA 108148 NA 3235 NA 60034 ...
##  $ X1990        : int  NA 119538 38369 22313 NA 116610 NA 2758 NA 67296 ...
##  $ X1989        : int  NA 114874 37401 27025 NA 113506 NA NA NA 65665 ...
##  $ X1988        : int  NA 122180 39514 18837 NA 119264 NA NA NA 66872 ...
##  $ X1987        : int  NA 127088 38944 8976 NA 109151 NA NA NA 75308 ...
##  $ X1986        : int  NA 126217 42052 7347 NA 101901 NA NA NA 66475 ...
##  $ X1985        : int  NA 128207 48793 NA NA 94970 NA NA NA 63986 ...
##  $ five.year.sum: int  NA 577172 NA NA 36917 1033437 196787 NA 47839 236878 ...
str(ttc)
## 'data.frame':    31 obs. of  32 variables:
##  $ FARE.MEDIA: Factor w/ 21 levels "     BLIND/WAR AMPS",..: 19 13 12 16 10 11 7 14 15 2 ...
##  $ X2015     : int  NA 110945 NA NA 13323 204509 48396 NA 8843 48873 ...
##  $ X2014     : int  NA 111157 NA NA 9862 214932 42855 NA 9361 49120 ...
##  $ X2013     : int  NA 112360 NA NA 8194 213982 38426 NA 9557 48623 ...
##  $ X2012     : int  NA 117962 NA NA 4399 205086 35019 NA 10185 46467 ...
##  $ X2011     : int  NA 124748 NA NA 1139 194928 32091 NA 9893 43795 ...
##  $ X2010     : int  NA 120366 1298 NA 0 203101 9200 NA 9237 43149 ...
##  $ X2009     : int  NA 114686 8807 NA 0 208172 NA NA 8738 41445 ...
##  $ X2008     : int  NA 94210 34445 NA NA 203313 NA NA 7517 39408 ...
##  $ X2007     : int  NA 69134 65398 NA NA 195001 NA NA 7126 36317 ...
##  $ X2006     : int  NA 75340 68546 NA NA 171314 NA NA 5413 38684 ...
##  $ X2005     : int  NA 82162 73151 NA NA 140594 NA NA 1296 47521 ...
##  $ X2004     : int  NA 80859 72952 NA NA 125836 NA NA NA 55172 ...
##  $ X2003     : int  NA 80330 71485 NA NA 119681 NA NA NA 51328 ...
##  $ X2002     : int  NA 82102 74578 NA NA 116805 NA 762 NA 51052 ...
##  $ X2001     : int  NA 83771 70930 NA NA 118176 NA 1118 NA 58400 ...
##  $ X2000     : int  NA 82218 66331 NA NA 112081 NA 1105 NA 61539 ...
##  $ X1999     : int  NA 83028 64109 NA NA 103447 NA 1430 NA 54835 ...
##  $ X1998     : int  NA 85303 66490 NA NA 98473 NA 1649 NA 49658 ...
##  $ X1997     : int  NA 86991 66177 NA NA 91521 NA 1592 NA 46209 ...
##  $ X1996     : int  NA 87857 67164 4329 NA 86549 NA 1652 NA 32642 ...
##  $ X1995     : int  NA 87775 70369 12525 NA 96803 NA 1976 NA 20930 ...
##  $ X1994     : int  NA 97877 62700 13265 NA 96907 NA 2111 NA 20708 ...
##  $ X1993     : int  NA 104016 57710 12894 NA 100607 NA 2514 NA 22131 ...
##  $ X1992     : int  NA 114064 53655 11201 NA 109509 NA 2924 NA 23696 ...
##  $ X1991     : int  NA 111365 35788 17927 NA 108148 NA 3235 NA 60034 ...
##  $ X1990     : int  NA 119538 38369 22313 NA 116610 NA 2758 NA 67296 ...
##  $ X1989     : int  NA 114874 37401 27025 NA 113506 NA NA NA 65665 ...
##  $ X1988     : int  NA 122180 39514 18837 NA 119264 NA NA NA 66872 ...
##  $ X1987     : int  NA 127088 38944 8976 NA 109151 NA NA NA 75308 ...
##  $ X1986     : int  NA 126217 42052 7347 NA 101901 NA NA NA 66475 ...
##  $ X1985     : int  NA 128207 48793 NA NA 94970 NA NA NA 63986 ...
#so we made a new column.
#can do this to original dataframe too without making new object
ttc %>% mutate(five.year.sum = X2015+X2014+X2013+X2012+X2011)
##                   FARE.MEDIA  X2015  X2014  X2013  X2012  X2011  X2010
## 1                      ADULT     NA     NA     NA     NA     NA     NA
## 2                     TOKENS 110945 111157 112360 117962 124748 120366
## 3                    TICKETS     NA     NA     NA     NA     NA   1298
## 4                   TWO-FARE     NA     NA     NA     NA     NA     NA
## 5                     PRESTO  13323   9862   8194   4399   1139      0
## 6       REGULAR MONTHLY PASS 204509 214932 213982 205086 194928 203101
## 7        POST-SECONDARY PASS  48396  42855  38426  35019  32091   9200
## 8               TWIN-GO PASS     NA     NA     NA     NA     NA     NA
## 9                WEEKLY PASS   8843   9361   9557  10185   9893   9237
## 10                      CASH  48873  49120  48623  46467  43795  43149
## 11                 SUB-TOTAL 434889 437287 431142 419118 406594 386351
## 12            SENIOR/STUDENT     NA     NA     NA     NA     NA     NA
## 13              MONTHLY PASS  25092  23064  20509  19769  18590  17169
## 14               WEEKLY PASS    672    515    540    624    702    814
## 15                   TICKETS  32595  33408  35472  37039  38299  38674
## 16                  TWO-FARE     NA     NA     NA     NA     NA     NA
## 17                    PRESTO    438     12     NA     NA     NA     NA
## 18                      CASH  12170  12037   8538   8164   7609   5856
## 19                 SUB-TOTAL  70967  69036  65059  65596  65200  62513
## 20                  CHILDREN     NA     NA     NA     NA     NA     NA
## 21                FREE RIDES  10939     NA     NA     NA     NA     NA
## 22                   TICKETS   1066   7097   7563   7929   8304   8287
## 23                    PRESTO     10     NA     NA     NA     NA     NA
## 24                      CASH    526   3705   2708   2589   2433   2539
## 25                 SUB-TOTAL  12541  10802  10271  10518  10737  10826
## 26           DAY/VIST./OTHER   8561  10033  11428  11929  10642  10605
## 27            BLIND/WAR AMPS   1086   1119   1109   1086   1060   1073
## 28           PREMIUM EXPRESS    490    451    401    372    344    322
## 29           POSTAL CARRIERS     NA     NA     NA     NA     NA     NA
## 30                  GTA PASS   5471   6087   5784   5388   5642   5667
## 31              SYSTEM TOTAL 534005 534815 525194 514007 500219 477357
##     X2009  X2008  X2007  X2006  X2005  X2004  X2003  X2002  X2001  X2000
## 1      NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 2  114686  94210  69134  75340  82162  80859  80330  82102  83771  82218
## 3    8807  34445  65398  68546  73151  72952  71485  74578  70930  66331
## 4      NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 5       0     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 6  208172 203313 195001 171314 140594 125836 119681 116805 118176 112081
## 7      NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 8      NA     NA     NA     NA     NA     NA     NA    762   1118   1105
## 9    8738   7517   7126   5413   1296     NA     NA     NA     NA     NA
## 10  41445  39408  36317  38684  47521  55172  51328  51052  58400  61539
## 11 381848 378893 372976 359297 344724 334819 322824 325299 332395 323274
## 12     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 13  15331  14864  14506  12931  11068   9940  10586  11123  12397  11785
## 14    874    780    686    372     93      0      0      0      0      0
## 15  38615  39097  40181  40808  42746  41562  41844  44018  44012  43885
## 16     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 17     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 18   5526   5253   4211   4581   6549   7602   6759   6440   7507   7921
## 19  60346  59994  59584  58692  60456  59104  59189  61581  63916  63591
## 20     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 21     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 22   8562   8782   8959   8879   8143   7573   7915   8869   9133   9401
## 23     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 24   2410   2253   1933   2168   3916   4514   4072   3938   3999   4203
## 25  10972  11035  10892  11047  12059  12087  11987  12807  13132  13604
## 26  10880   9961   9636   9194   7598   6191   5817   9685   4943   4837
## 27   1074   1092   1094   1025   1026   1065   1113   1275   1223   1138
## 28    313    310    295    259    260    278    280    282    339    330
## 29     NA     NA     NA     58    702    700    664    683    719    753
## 30   5800   5415   5292   4972   4395   3855   3538   3927   3326   3031
## 31 471233 466700 459769 444544 431220 418099 405412 415539 419993 410558
##     X1999  X1998  X1997  X1996  X1995  X1994  X1993  X1992  X1991  X1990
## 1      NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 2   83028  85303  86991  87857  87775  97877 104016 114064 111365 119538
## 3   64109  66490  66177  67164  70369  62700  57710  53655  35788  38369
## 4      NA     NA     NA   4329  12525  13265  12894  11201  17927  22313
## 5      NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 6  103447  98473  91521  86549  96803  96907 100607 109509 108148 116610
## 7      NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 8    1430   1649   1592   1652   1976   2111   2514   2924   3235   2758
## 9      NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 10  54835  49658  46209  32642  20930  20708  22131  23696  60034  67296
## 11 306849 301573 292490 280193 290378 293568 299872 315049 336497 366884
## 12     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 13  10124   9419   8647   9098  11028  10716  10508  10389   5850   5117
## 14      0      0      0      0      0      0      0      0      0      0
## 15  44263  46559  48306  52852  58515  56539  56625  57082  56559  61358
## 16     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 17     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 18   7481   7197   7665   8114   5202   4772   4703   3725   6030   6324
## 19  61868  63175  64618  70064  74745  72027  71836  71196  68439  72799
## 20     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 21     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 22   9186   9640   9667  10129  11338  12192  12193  11569  11722  12417
## 23     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 24   4209   4344   4313   3674   2877   2717   2486   1860   2546   3071
## 25  13395  13984  13980  13803  14215  14909  14679  13429  14268  15488
## 26   5576   5236   4611   4234   4833   4764   4228   1853   2572    972
## 27   1131   1137   1057   1057   1059   1119   1122   1127   1093   1098
## 28    301    307    310    322    425    437    417    257     NA     NA
## 29    790    783    902   1107   1122   1085   1331   1340   1298   1993
## 30   2683   2494   1915   1650   1375    343     NA     NA     NA     NA
## 31 392593 388689 379883 372430 388152 388252 393485 404251 424167 459234
##     X1989  X1988  X1987  X1986  X1985 five.year.sum
## 1      NA     NA     NA     NA     NA            NA
## 2  114874 122180 127088 126217 128207        577172
## 3   37401  39514  38944  42052  48793            NA
## 4   27025  18837   8976   7347     NA            NA
## 5      NA     NA     NA     NA     NA         36917
## 6  113506 119264 109151 101901  94970       1033437
## 7      NA     NA     NA     NA     NA        196787
## 8      NA     NA     NA     NA     NA            NA
## 9      NA     NA     NA     NA     NA         47839
## 10  65665  66872  75308  66475  63986        236878
## 11 358471 366667 359467 343992 335956       2129030
## 12     NA     NA     NA     NA     NA            NA
## 13   4570   4541   3855   3288   2519        107024
## 14      0      0      0      0      0          3053
## 15  61837  65708  65927  62963  61193        176813
## 16     93     63     NA     NA     NA            NA
## 17     NA     NA     NA     NA     NA            NA
## 18   6267   6359   7449  10852  13117         48518
## 19  72767  76671  77231  77103  76829        335858
## 20     NA     NA     NA     NA     NA            NA
## 21     NA     NA     NA     NA     NA            NA
## 22  12261  13194  12581  12074  10884         31959
## 23     NA     NA     NA     NA     NA            NA
## 24   2916   3175   3526   3844   4768         11961
## 25  15177  16369  16107  15918  15652         54869
## 26   1312    641    612    649    630         52593
## 27   1139   1252   1593   1506   1272          5460
## 28     NA     NA     NA     NA     NA          2058
## 29   1860   1875   1874   1844   1821            NA
## 30     NA     NA     NA     NA     NA         28372
## 31 450726 463475 456884 441012 432160       2608240
ttc
##                   FARE.MEDIA  X2015  X2014  X2013  X2012  X2011  X2010
## 1                      ADULT     NA     NA     NA     NA     NA     NA
## 2                     TOKENS 110945 111157 112360 117962 124748 120366
## 3                    TICKETS     NA     NA     NA     NA     NA   1298
## 4                   TWO-FARE     NA     NA     NA     NA     NA     NA
## 5                     PRESTO  13323   9862   8194   4399   1139      0
## 6       REGULAR MONTHLY PASS 204509 214932 213982 205086 194928 203101
## 7        POST-SECONDARY PASS  48396  42855  38426  35019  32091   9200
## 8               TWIN-GO PASS     NA     NA     NA     NA     NA     NA
## 9                WEEKLY PASS   8843   9361   9557  10185   9893   9237
## 10                      CASH  48873  49120  48623  46467  43795  43149
## 11                 SUB-TOTAL 434889 437287 431142 419118 406594 386351
## 12            SENIOR/STUDENT     NA     NA     NA     NA     NA     NA
## 13              MONTHLY PASS  25092  23064  20509  19769  18590  17169
## 14               WEEKLY PASS    672    515    540    624    702    814
## 15                   TICKETS  32595  33408  35472  37039  38299  38674
## 16                  TWO-FARE     NA     NA     NA     NA     NA     NA
## 17                    PRESTO    438     12     NA     NA     NA     NA
## 18                      CASH  12170  12037   8538   8164   7609   5856
## 19                 SUB-TOTAL  70967  69036  65059  65596  65200  62513
## 20                  CHILDREN     NA     NA     NA     NA     NA     NA
## 21                FREE RIDES  10939     NA     NA     NA     NA     NA
## 22                   TICKETS   1066   7097   7563   7929   8304   8287
## 23                    PRESTO     10     NA     NA     NA     NA     NA
## 24                      CASH    526   3705   2708   2589   2433   2539
## 25                 SUB-TOTAL  12541  10802  10271  10518  10737  10826
## 26           DAY/VIST./OTHER   8561  10033  11428  11929  10642  10605
## 27            BLIND/WAR AMPS   1086   1119   1109   1086   1060   1073
## 28           PREMIUM EXPRESS    490    451    401    372    344    322
## 29           POSTAL CARRIERS     NA     NA     NA     NA     NA     NA
## 30                  GTA PASS   5471   6087   5784   5388   5642   5667
## 31              SYSTEM TOTAL 534005 534815 525194 514007 500219 477357
##     X2009  X2008  X2007  X2006  X2005  X2004  X2003  X2002  X2001  X2000
## 1      NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 2  114686  94210  69134  75340  82162  80859  80330  82102  83771  82218
## 3    8807  34445  65398  68546  73151  72952  71485  74578  70930  66331
## 4      NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 5       0     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 6  208172 203313 195001 171314 140594 125836 119681 116805 118176 112081
## 7      NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 8      NA     NA     NA     NA     NA     NA     NA    762   1118   1105
## 9    8738   7517   7126   5413   1296     NA     NA     NA     NA     NA
## 10  41445  39408  36317  38684  47521  55172  51328  51052  58400  61539
## 11 381848 378893 372976 359297 344724 334819 322824 325299 332395 323274
## 12     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 13  15331  14864  14506  12931  11068   9940  10586  11123  12397  11785
## 14    874    780    686    372     93      0      0      0      0      0
## 15  38615  39097  40181  40808  42746  41562  41844  44018  44012  43885
## 16     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 17     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 18   5526   5253   4211   4581   6549   7602   6759   6440   7507   7921
## 19  60346  59994  59584  58692  60456  59104  59189  61581  63916  63591
## 20     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 21     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 22   8562   8782   8959   8879   8143   7573   7915   8869   9133   9401
## 23     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 24   2410   2253   1933   2168   3916   4514   4072   3938   3999   4203
## 25  10972  11035  10892  11047  12059  12087  11987  12807  13132  13604
## 26  10880   9961   9636   9194   7598   6191   5817   9685   4943   4837
## 27   1074   1092   1094   1025   1026   1065   1113   1275   1223   1138
## 28    313    310    295    259    260    278    280    282    339    330
## 29     NA     NA     NA     58    702    700    664    683    719    753
## 30   5800   5415   5292   4972   4395   3855   3538   3927   3326   3031
## 31 471233 466700 459769 444544 431220 418099 405412 415539 419993 410558
##     X1999  X1998  X1997  X1996  X1995  X1994  X1993  X1992  X1991  X1990
## 1      NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 2   83028  85303  86991  87857  87775  97877 104016 114064 111365 119538
## 3   64109  66490  66177  67164  70369  62700  57710  53655  35788  38369
## 4      NA     NA     NA   4329  12525  13265  12894  11201  17927  22313
## 5      NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 6  103447  98473  91521  86549  96803  96907 100607 109509 108148 116610
## 7      NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 8    1430   1649   1592   1652   1976   2111   2514   2924   3235   2758
## 9      NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 10  54835  49658  46209  32642  20930  20708  22131  23696  60034  67296
## 11 306849 301573 292490 280193 290378 293568 299872 315049 336497 366884
## 12     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 13  10124   9419   8647   9098  11028  10716  10508  10389   5850   5117
## 14      0      0      0      0      0      0      0      0      0      0
## 15  44263  46559  48306  52852  58515  56539  56625  57082  56559  61358
## 16     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 17     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 18   7481   7197   7665   8114   5202   4772   4703   3725   6030   6324
## 19  61868  63175  64618  70064  74745  72027  71836  71196  68439  72799
## 20     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 21     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 22   9186   9640   9667  10129  11338  12192  12193  11569  11722  12417
## 23     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 24   4209   4344   4313   3674   2877   2717   2486   1860   2546   3071
## 25  13395  13984  13980  13803  14215  14909  14679  13429  14268  15488
## 26   5576   5236   4611   4234   4833   4764   4228   1853   2572    972
## 27   1131   1137   1057   1057   1059   1119   1122   1127   1093   1098
## 28    301    307    310    322    425    437    417    257     NA     NA
## 29    790    783    902   1107   1122   1085   1331   1340   1298   1993
## 30   2683   2494   1915   1650   1375    343     NA     NA     NA     NA
## 31 392593 388689 379883 372430 388152 388252 393485 404251 424167 459234
##     X1989  X1988  X1987  X1986  X1985
## 1      NA     NA     NA     NA     NA
## 2  114874 122180 127088 126217 128207
## 3   37401  39514  38944  42052  48793
## 4   27025  18837   8976   7347     NA
## 5      NA     NA     NA     NA     NA
## 6  113506 119264 109151 101901  94970
## 7      NA     NA     NA     NA     NA
## 8      NA     NA     NA     NA     NA
## 9      NA     NA     NA     NA     NA
## 10  65665  66872  75308  66475  63986
## 11 358471 366667 359467 343992 335956
## 12     NA     NA     NA     NA     NA
## 13   4570   4541   3855   3288   2519
## 14      0      0      0      0      0
## 15  61837  65708  65927  62963  61193
## 16     93     63     NA     NA     NA
## 17     NA     NA     NA     NA     NA
## 18   6267   6359   7449  10852  13117
## 19  72767  76671  77231  77103  76829
## 20     NA     NA     NA     NA     NA
## 21     NA     NA     NA     NA     NA
## 22  12261  13194  12581  12074  10884
## 23     NA     NA     NA     NA     NA
## 24   2916   3175   3526   3844   4768
## 25  15177  16369  16107  15918  15652
## 26   1312    641    612    649    630
## 27   1139   1252   1593   1506   1272
## 28     NA     NA     NA     NA     NA
## 29   1860   1875   1874   1844   1821
## 30     NA     NA     NA     NA     NA
## 31 450726 463475 456884 441012 432160
#notice any errors? :)

Practical skill outcomes of R stats useful for competency test Check and address missing values. Grab part of a dataset. Use pipes to move/wrangle chunks of your dataset

Lesson 3. Visualization in r (VR).

basic plots. lattice. ggplot2. you need to see data. see the trends. explore them using visuals.

Contemporary data viz for statistical analyses slide deck

Philosophy of R stats Clean simple graphics are powerful tools in statistics (and in scientific communication). Tufte and others have shaped data scientists and statisticians in developing more libraries, new standards, and assumptions associated with graphical representations of data. Data viz must highlight the differences, show underlying data structures, and provide insights into the specific research project. R is infinitely customizable in all these respects. There are at least two major current paradigms (there are more these are the two dominant idea sets). Base R plots are simple, relatively flexible, and very easy. However, their grammar, i.e their rules of coding are not modern. Ggplot and related libraries invoke a new, formal grammar of graphics that is more logical, more flexible, but divergent from base R code. It is worth the time to understand the differences and know when to use each.

Evolution of plotting in statistics using R in particular went from base-R then onto lattice then to the ggvis universe with the most recent library being ggplot2. Base-R is certainly useful in some contexts as is the lattice and lattice extra library. However, ggplot2 now encompasses all these capacities with a much simpler set of grammar (i.e. rules and order). Nonetheless, you should be able to read base-R code for plots and be able to do some as well. The philosophy or grammar of modern graphics is well articulated and includes the following key principles.

The grammar of graphics layers primacy of layers (simple first, then more complex) i.e. you build up your plots data are mapped to aesthetic attributes and geometric objects data first then statistics even in plots

Disclaimer: I love the power of qplots.

ggplot2 cheatsheet

Data viz case study #1.

library(ggplot2)
survey<-read.csv("data/5081.survey.1.csv")
str(survey)
## 'data.frame':    18 obs. of  4 variables:
##  $ r.experience : int  2 1 2 3 1 1 3 1 1 3 ...
##  $ discipline   : Factor w/ 4 levels "ecology","environmental science",..: 4 3 4 4 1 4 1 1 1 3 ...
##  $ research.data: Factor w/ 2 levels "qualitative",..: 2 2 2 2 2 2 2 1 2 2 ...
##  $ r.studio     : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 2 1 2 1 ...
plot(survey$r.experience) #hard to tell what is going on

qplot(r.experience, data=survey) #decided to make bins for me and count up)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#so, we know better and instead do a histogram using base graphics
#basic data viz for EDA
hist(survey$r.experience) #better

qplot(r.experience, data=survey, geom="histogram") #same as what is picked for us
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qplot(r.experience, data=survey, geom="histogram", binwidth=0.5)

barplot(survey$r.experience) #confusing

qplot(r.experience, data=survey, geom="bar") #what, it is back!

#basic data viz for EDA but for interactions
plot(survey$discipline, survey$r.experience)

qplot(discipline, r.experience, data=survey) #not the same

qplot(discipline, r.experience, data=survey, geom="boxplot")

plot(survey$r.studio~survey$r.experience) #ugly

qplot(r.experience, r.studio, data=survey) #useless

qplot(r.studio, data=survey, weight = r.experience) #sweet new feature here

#ok, so you get it. grammar different, visuals about the same for super quick, simple plots. The grammar hints at the power that awaits though.

#grammar different, simple x or x~y plots about the same

Data viz case study #2.

str(diamonds)
## Classes 'tbl_df', 'tbl' and 'data.frame':    53940 obs. of  10 variables:
##  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
#crazy number of observations. We need less. too many riches not always good.
set.seed(1000)
dsmall<-diamonds[sample(nrow(diamonds), 100), ]

plot(dsmall$carat, dsmall$price)

qplot(carat, price, data=dsmall)

#ok no difference
#now let's see what we can do with qplot with a few bits added
#one little argument extra added
qplot(carat, price, data = dsmall, colour = color)

qplot(carat, price, data = dsmall, shape = cut)

#how about using data viz to even more thoroughly explore potential stats we could do.
#qplots - quick plot, thoughful build of layers
qplot(carat, price, data = dsmall, geom = c("point", "smooth"))

#what about trying some stats on this now, at least from a viz philosophy
qplot(color, price / carat, data = dsmall, geom = "boxplot") #can include formulas and methods

#or check for proportional differences
qplot(carat, data = dsmall, geom = "histogram", fill = color) #to see proportions
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qplot(carat, data = dsmall, geom = "histogram", weight = price) # weight by a covariate
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#final idea, how about subsetting with the data right in the code for the plots!
qplot(carat, data = diamonds, facets = color ~ .,
  geom = "histogram", binwidth = 0.1, xlim = c(0, 3)) #to compare between groups
## Warning: Removed 32 rows containing non-finite values (stat_bin).

#qplot is so powerful.
#colour, shape and size have meaning in the R code from this library
#layers added for you by qplots

#qplot gets you 2x and one y or one x and 2y so >2 variables at once easily

Data viz case study #3.

#GGPLOT() gives you even more options for adding layers/options
p <- ggplot(mtcars, aes(x = mpg, y = wt))
p + geom_point()

#now play time with this case study.
#try out some geom options and different aesthetics and make some errors.
#prize for the prettiest plots

#displ is car engine size in Liters
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class))

#so aethetics are one way to add variables in and expand your plotting power
#however facets are another way to make multiple plots BY a factor

#facet wrap is by one variable
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2)

#facet_wrap(~cell) - univariate: create a 1-d strip of panels, based on one factor, and wrap the strip into a 2-d matrix
#facet_grid(row~col) - (usually) bivariate: create a 2-d matrix of panels, based on two factors

#facet grid is by two variables
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_grid(drv ~ cyl)

#another example more perfect code
p <- ggplot(data = mpg, aes(x = displ, y = hwy, color = drv)) + geom_point()
p + facet_wrap(~cyl)

#or just use facets in qplot but much simpler
qplot(displ, hwy, data=mpg, facets = . ~ year) + geom_smooth()

Data viz case study #4.

#try it with ecological.footprints.csv
footprints <-read.csv("data/ecological.footprints.csv")
str(footprints)
## 'data.frame':    191 obs. of  3 variables:
##  $ country             : Factor w/ 145 levels "Albania","Algeria",..: 137 139 71 37 92 63 6 45 25 123 ...
##  $ ecological.footprint: num  15.99 12.22 10.31 9.88 9.54 ...
##  $ yr                  : int  2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
#aha, R thinks year is an integet
footprints$yr <- as.factor(footprints$yr)
library(ggplot2)
qplot(country, ecological.footprint, data = footprints) #too messy

qplot(country, ecological.footprint, data = footprints, colour = yr) #better but still a lot to process

qplot(country, ecological.footprint, data = footprints, facets = yr~.) #ok but do not love. hard to see distribution

qplot(ecological.footprint, data = footprints) #you know what, this is not bad.  maybe add year in too/
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qplot(ecological.footprint, data = footprints, fill = yr) #ok now I starting to see structure and differences
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#OK, now I am ready for stats. Thinking about these data, I see we have only two years for some countries so cannot do within country or between country contrasts. So, most likely hypothesis I can test is whether ecological footprints are increasing between these two years. Not a perfect dataset really but could compare these two years.

t.test(footprints$ecological.footprint~ footprints$yr)
## 
##  Welch Two Sample t-test
## 
## data:  footprints$ecological.footprint by footprints$yr
## t = 2.7386, df = 174.55, p-value = 0.00681
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.2207405 1.3597701
## sample estimates:
## mean in group 2000 mean in group 2012 
##           3.124255           2.334000
#ok looks like there are differences between years but it is hard to tell from previous plot. Realize now, I need a better plot still?

qplot(yr, ecological.footprint, data = footprints, geom="boxplot") #this is weird, 2000 looks higher

#different countries between years?
#more countries reported in 2000?
library(dplyr)
footprints %>% count(yr)
## # A tibble: 2 × 2
##       yr     n
##   <fctr> <int>
## 1   2000   141
## 2   2012    50
#Yup, way more data for year 2000
#maybe should just the countries that were testedin both years?

library(tidyr)
## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:reshape2':
## 
##     smiths
matches <- spread(footprints, yr, ecological.footprint) %>% filter() %>% na.omit
str(matches)
## 'data.frame':    46 obs. of  3 variables:
##  $ country: Factor w/ 145 levels "Albania","Algeria",..: 1 2 4 7 9 18 28 30 31 34 ...
##  $ 2000   : num  1.86 1.79 3.79 5.45 0.6 2.6 3.39 1.9 2.77 2.1 ...
##  $ 2012   : num  1.8 1.6 2.7 5.3 0.7 2.9 3.2 1.8 2.5 1.9 ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:99] 3 5 6 8 10 11 12 13 14 15 ...
##   .. ..- attr(*, "names")= chr [1:99] "3" "5" "6" "8" ...
matches
##            country 2000 2012
## 1          Albania 1.86  1.8
## 2          Algeria 1.79  1.6
## 4        Argentina 3.79  2.7
## 7          Austria 5.45  5.3
## 9       Bangladesh 0.60  0.7
## 18          Brazil 2.60  2.9
## 28           Chile 3.39  3.2
## 30        Colombia 1.90  1.8
## 31      Costa Rica 2.77  2.5
## 34            Cuba 2.10  1.9
## 40         Ecuador 2.26  2.4
## 42     El Salvador 1.55  2.0
## 46          France 7.27  4.9
## 48         Germany 6.31  4.6
## 51       Guatemala 1.40  1.8
## 56        Honduras 1.43  1.7
## 59           India 1.06  0.9
## 60       Indonesia 1.48  1.1
## 62            Iraq 1.73  1.4
## 64          Israel 5.40  4.0
## 66         Jamaica 2.68  1.7
## 67           Japan 5.94  4.2
## 68          Jordan 1.71  2.1
## 72      Kyrgyzstan 1.87  1.3
## 73            Laos 0.91  1.3
## 79      Madagascar 0.93  1.2
## 84          Mexico 2.69  3.3
## 85         Moldova 2.47  2.1
## 87         Morocco 1.56  1.3
## 92     New Zealand 9.54  4.3
## 93       Nicaragua 1.26  1.6
## 97          Norway 6.13  4.8
## 99        Pakistan 1.09  0.8
## 101         Panama 2.35  3.0
## 104           Peru 1.33  2.0
## 105    Philippines 1.42  1.0
## 121      Sri Lanka 0.95  1.2
## 124    Switzerland 6.63  5.0
## 125          Syria 2.56  1.5
## 126     Tajikistan 0.90  0.9
## 128       Thailand 2.70  2.4
## 132        Tunisia 2.27  1.8
## 133         Turkey 2.73  2.6
## 138 United Kingdom 6.29  4.7
## 142      Venezuela 2.88  3.0
## 143        Vietnam 0.95  1.4
#whoa, USA and Canada missing, and we have HUGE footprints.

#got it. just the countries with measure in BOTH years.
#now, gather up again with these filtered matches!

new <- matches %>% gather(`2000`, `2012`, key = "yr", value ="ecological.footprint")
new #ok so now a nice dataframe with just matches back in a format I can use for plots and stats
##           country   yr ecological.footprint
## 1         Albania 2000                 1.86
## 2         Algeria 2000                 1.79
## 3       Argentina 2000                 3.79
## 4         Austria 2000                 5.45
## 5      Bangladesh 2000                 0.60
## 6          Brazil 2000                 2.60
## 7           Chile 2000                 3.39
## 8        Colombia 2000                 1.90
## 9      Costa Rica 2000                 2.77
## 10           Cuba 2000                 2.10
## 11        Ecuador 2000                 2.26
## 12    El Salvador 2000                 1.55
## 13         France 2000                 7.27
## 14        Germany 2000                 6.31
## 15      Guatemala 2000                 1.40
## 16       Honduras 2000                 1.43
## 17          India 2000                 1.06
## 18      Indonesia 2000                 1.48
## 19           Iraq 2000                 1.73
## 20         Israel 2000                 5.40
## 21        Jamaica 2000                 2.68
## 22          Japan 2000                 5.94
## 23         Jordan 2000                 1.71
## 24     Kyrgyzstan 2000                 1.87
## 25           Laos 2000                 0.91
## 26     Madagascar 2000                 0.93
## 27         Mexico 2000                 2.69
## 28        Moldova 2000                 2.47
## 29        Morocco 2000                 1.56
## 30    New Zealand 2000                 9.54
## 31      Nicaragua 2000                 1.26
## 32         Norway 2000                 6.13
## 33       Pakistan 2000                 1.09
## 34         Panama 2000                 2.35
## 35           Peru 2000                 1.33
## 36    Philippines 2000                 1.42
## 37      Sri Lanka 2000                 0.95
## 38    Switzerland 2000                 6.63
## 39          Syria 2000                 2.56
## 40     Tajikistan 2000                 0.90
## 41       Thailand 2000                 2.70
## 42        Tunisia 2000                 2.27
## 43         Turkey 2000                 2.73
## 44 United Kingdom 2000                 6.29
## 45      Venezuela 2000                 2.88
## 46        Vietnam 2000                 0.95
## 47        Albania 2012                 1.80
## 48        Algeria 2012                 1.60
## 49      Argentina 2012                 2.70
## 50        Austria 2012                 5.30
## 51     Bangladesh 2012                 0.70
## 52         Brazil 2012                 2.90
## 53          Chile 2012                 3.20
## 54       Colombia 2012                 1.80
## 55     Costa Rica 2012                 2.50
## 56           Cuba 2012                 1.90
## 57        Ecuador 2012                 2.40
## 58    El Salvador 2012                 2.00
## 59         France 2012                 4.90
## 60        Germany 2012                 4.60
## 61      Guatemala 2012                 1.80
## 62       Honduras 2012                 1.70
## 63          India 2012                 0.90
## 64      Indonesia 2012                 1.10
## 65           Iraq 2012                 1.40
## 66         Israel 2012                 4.00
## 67        Jamaica 2012                 1.70
## 68          Japan 2012                 4.20
## 69         Jordan 2012                 2.10
## 70     Kyrgyzstan 2012                 1.30
## 71           Laos 2012                 1.30
## 72     Madagascar 2012                 1.20
## 73         Mexico 2012                 3.30
## 74        Moldova 2012                 2.10
## 75        Morocco 2012                 1.30
## 76    New Zealand 2012                 4.30
## 77      Nicaragua 2012                 1.60
## 78         Norway 2012                 4.80
## 79       Pakistan 2012                 0.80
## 80         Panama 2012                 3.00
## 81           Peru 2012                 2.00
## 82    Philippines 2012                 1.00
## 83      Sri Lanka 2012                 1.20
## 84    Switzerland 2012                 5.00
## 85          Syria 2012                 1.50
## 86     Tajikistan 2012                 0.90
## 87       Thailand 2012                 2.40
## 88        Tunisia 2012                 1.80
## 89         Turkey 2012                 2.60
## 90 United Kingdom 2012                 4.70
## 91      Venezuela 2012                 3.00
## 92        Vietnam 2012                 1.40
qplot(yr, ecological.footprint, data = footprints, geom="boxplot")

t.test(new$ecological.footprint~ new$yr, paired=TRUE)
## 
##  Paired t-test
## 
## data:  new$ecological.footprint by new$yr
## t = 2.7553, df = 45, p-value = 0.008434
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.1121662 0.7217469
## sample estimates:
## mean of the differences 
##               0.4169565
#Well, I give up, seems like the footprints for the world went down not up in this time period. GOOD NEWS for the environmental movement in some respects.
new %>% group_by(yr) %>% summarise(mean(ecological.footprint))
## # A tibble: 2 × 2
##      yr `mean(ecological.footprint)`
##   <chr>                        <dbl>
## 1  2000                     2.801739
## 2  2012                     2.384783
# We still use 2.4 planets worth of resources but only have one.

#AND, we are missing some key countries that contribute to global change including Canada and USA.

Practical skill outcomes of R stats useful for competency test Do meaningful plot of a single variable wth nice labels. Do a meaningful plot of a relationship between two variables. Use qplots to do a plot that includes an aesthetic. Do a ggplot that includes three variables.

Physics example in Wired Mag why to plot data even if you have a model

Lesson 4. Exploratory data analysis (EDA) in r.

fundamental descriptive stats. distributions via density plots GLM. GLMM. Post hoc tests. modelR

slide deck for EDA

Philosophy of R stats Exploratory data analyses is everything we have done. a. Tidy data. b. Inspect data structure. c. Data viz. d. Basic exploratory data analyses.

However, now that we are ready to apply models, we add in one more tiny step. Visualize the data to better understand its typology and underlying distribution. Then, you are ready to fit your models.

A statistical model is an elegant, representative simplification of the patterns you have identified through data viz and EDA. It should capture data/experimental structure including the key variables, appropriate levels, and relevant covariation or contexts that mediate outcomes. It should support the data viz. It should provide an estimate of the statistical likelihood or probability of differences. Ideally, the underlying coefficients should also be mined to convey an estimate of effect sizes. A t.test, chi.square test, linear model, general linear model, or generalized linear mixed model are all examples of models that describe and summarize patterns and each have associated assumptions about the data they embody. Hence, the final step pre-model fit, is explore distributions.

Conceptually, there are two kind of models. Those that look back and those that look forward. Think tardis or time machine. A model is always a snapshot using your time machine. It can be a grab of what happened or a future snap of what you predict. In R, there is simple code to time travel in either direction. Actually, there is no time - Arrow of time - only an observer potential perception of it. Statistical models are our observers here. These observers use ‘probability distributions’ as we described in the first week sensu statistical thinking to calibrate what the think we observed or will observe given the evidence at hand.

Case study #1: x,y continuous

library(ggplot2)
library(dplyr)
library(modelr)
#Data viz for pattern
qplot(x,y, data = sim1)

#Now inspect distribution of y
ggplot(sim1) + 
  geom_density(mapping = aes(y))

ggplot(sim1) + 
  geom_histogram(mapping = aes(y), binwidth=5)

shapiro.test(sim1$y) #p-value <0.05 means is different from normal
## 
##  Shapiro-Wilk normality test
## 
## data:  sim1$y
## W = 0.97483, p-value = 0.6777
qqnorm(sim1$y) #qqplots common to inspsect quantiles

#not significantly different from normal. great.
#model time!

#Remember, you own the purpose! Does x predict y? Linear model straight up!

m1 <-lm(y~x, data = sim1)
summary(m1)
## 
## Call:
## lm(formula = y ~ x, data = sim1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1469 -1.5197  0.1331  1.4670  4.6516 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.2208     0.8688   4.858 4.09e-05 ***
## x             2.0515     0.1400  14.651 1.17e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.203 on 28 degrees of freedom
## Multiple R-squared:  0.8846, Adjusted R-squared:  0.8805 
## F-statistic: 214.7 on 1 and 28 DF,  p-value: 1.173e-14
#wow, these simulated are too good. Remember, this is the backwards, time travel, hypothesis test given what we have. Not prediction of new data per se but description of observation of patterns.
coef(m1) #remember 'minecraft' your model a bit to get a sense of effects.
## (Intercept)           x 
##    4.220822    2.051533

Case study #2 categorical x, continuous y

ggplot(sim2) + 
  geom_point(aes(x, y)) #categorical x

ggplot(sim2) + 
  geom_density(mapping = aes(y))

ggplot(sim2) + 
  geom_histogram(mapping = aes(y), binwidth=1) #changing binwidth really changes perception of distribution

shapiro.test(sim2$y) 
## 
##  Shapiro-Wilk normality test
## 
## data:  sim2$y
## W = 0.92313, p-value = 0.009676
qqnorm(sim2$y) #non-normal

#so, could do anova but likely not great link to underlying probability distribution
m2 <- aov(y~x, data = sim2)
summary(m2)
##             Df Sum Sq Mean Sq F value Pr(>F)    
## x            3  335.1  111.71   92.52 <2e-16 ***
## Residuals   36   43.5    1.21                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
names(m2)
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "contrasts"     "xlevels"       "call"          "terms"        
## [13] "model"
plot(m2$residuals) #residuals not bad

#library(fitdistrplus)

m3 <- glm(y~x, data = sim2, family = "quasi")
summary(m3) #better
## 
## Call:
## glm(formula = y ~ x, family = "quasi", data = sim2)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.40131  -0.43996  -0.05776   0.49066   2.63938  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.1522     0.3475   3.316  0.00209 ** 
## xb            6.9639     0.4914  14.171 2.68e-16 ***
## xc            4.9750     0.4914  10.124 4.47e-12 ***
## xd            0.7588     0.4914   1.544  0.13131    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasi family taken to be 1.207495)
## 
##     Null deviance: 378.61  on 39  degrees of freedom
## Residual deviance:  43.47  on 36  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 2
#you can also explore distributions in even more detail to ensure correct model (correct = matches/describes underlying structure AND data distribution)
anova(m3, test="Chisq") #is x significant factor?
## Analysis of Deviance Table
## 
## Model: quasi, link: identity
## 
## Response: y
## 
## Terms added sequentially (first to last)
## 
## 
##      Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                    39     378.61              
## x     3   335.14        36      43.47 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Cullen and Frey Graphs are cool
#fitdistrplus not bad.  Usually, the type of data, ie. count, frequency, proportion is just as effective on deciding on family type vesus distribution exploration.  The goal is to fit the 'best' model. Best is simplest and representative. Formal tools to contrast models sometimes help too.

#note, mixed models, can use lme4 package if some effects and others are random. Need to think this over. Fixed = groups or levels in factors not due to random causes, random effects = likely from random causes or latent drivers such as population/species specificity.

Case study #3: interactions cat.x, cont.x, y

str(sim3)
## Classes 'tbl_df', 'tbl' and 'data.frame':    120 obs. of  5 variables:
##  $ x1 : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ x2 : Factor w/ 4 levels "a","b","c","d": 1 1 1 2 2 2 3 3 3 4 ...
##  $ rep: int  1 2 3 1 2 3 1 2 3 1 ...
##  $ y  : Named num  -0.571 1.184 2.237 7.437 8.518 ...
##   ..- attr(*, "names")= chr  "a" "a" "a" "b" ...
##  $ sd : num  2 2 2 2 2 2 2 2 2 2 ...
ggplot(sim3) +
  geom_point(aes(x1, y, colour = x2))

#x1 is continous
#x2 is categorical
#Q are there an effect of xs on y and do the effect interact, i.e. level of x1 influence changes by x2.

ggplot(sim3) + 
  geom_density(mapping = aes(y))

ggplot(sim3) + 
  geom_histogram(mapping = aes(y), binwidth=1)

shapiro.test(sim3$y) 
## 
##  Shapiro-Wilk normality test
## 
## data:  sim3$y
## W = 0.97593, p-value = 0.0299
qqnorm(sim3$y)  

#s.d. from normal but not bad. could be contigent on the levels of x.

ggplot(sim3, aes(x1, y)) + 
  geom_boxplot(aes(colour = x2))

#so, looks like the distribution of y relates to the factors. Likely good to go on parametric linear model.

m4 <-lm(y~x1*x2, data = sim3) #interactions terms for all levels
summary(m4)
## 
## Call:
## lm(formula = y ~ x1 * x2, data = sim3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.87634 -0.67655  0.04837  0.69963  2.58607 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.30124    0.40400   3.221  0.00167 ** 
## x1          -0.09302    0.06511  -1.429  0.15587    
## x2b          7.06938    0.57134  12.373  < 2e-16 ***
## x2c          4.43090    0.57134   7.755 4.41e-12 ***
## x2d          0.83455    0.57134   1.461  0.14690    
## x1:x2b      -0.76029    0.09208  -8.257 3.30e-13 ***
## x1:x2c       0.06815    0.09208   0.740  0.46076    
## x1:x2d       0.27728    0.09208   3.011  0.00322 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.024 on 112 degrees of freedom
## Multiple R-squared:  0.8221, Adjusted R-squared:  0.811 
## F-statistic: 73.93 on 7 and 112 DF,  p-value: < 2.2e-16
m5 <-lm(y~x1+x2, data = sim3) #independent x1 & x2 effects on 7 modeled.
summary(m5)
## 
## Call:
## lm(formula = y ~ x1 + x2, data = sim3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.4674 -0.8524 -0.0729  0.7886  4.3005 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.87167    0.38738   4.832 4.22e-06 ***
## x1          -0.19674    0.04871  -4.039 9.72e-05 ***
## x2b          2.88781    0.39571   7.298 4.07e-11 ***
## x2c          4.80574    0.39571  12.145  < 2e-16 ***
## x2d          2.35959    0.39571   5.963 2.79e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.533 on 115 degrees of freedom
## Multiple R-squared:  0.5911, Adjusted R-squared:  0.5768 
## F-statistic: 41.55 on 4 and 115 DF,  p-value: < 2.2e-16
plot(m5) #sometimes I plot the model to explore/mine model for its capacity to describe the patterns

anova(m5, test="Chisq") #tells you if effects are significant.
## Analysis of Variance Table
## 
## Response: y
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## x1          1  38.32  38.319  16.314 9.718e-05 ***
## x2          3 352.07 117.358  49.966 < 2.2e-16 ***
## Residuals 115 270.11   2.349                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#you can also contrast different models using the anova of different models.
m.b <-anova(m4,m5, test="Chisq")
m.b
## Analysis of Variance Table
## 
## Model 1: y ~ x1 * x2
## Model 2: y ~ x1 + x2
##   Res.Df    RSS Df Sum of Sq  Pr(>Chi)    
## 1    112 117.51                           
## 2    115 270.11 -3   -152.59 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Case study #4: interactions cont.x, cont.x, y

str(sim4)
## Classes 'tbl_df', 'tbl' and 'data.frame':    300 obs. of  4 variables:
##  $ x1 : num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ x2 : num  -1 -1 -1 -0.778 -0.778 ...
##  $ rep: int  1 2 3 1 2 3 1 2 3 1 ...
##  $ y  : num  4.2477 1.206 0.3535 -0.0467 4.6387 ...
ggplot(sim4) +
  geom_point(aes(x1, y, colour = x2))

ggplot(sim4) + 
  geom_density(mapping = aes(y))

ggplot(sim4) + 
  geom_histogram(mapping = aes(y), binwidth=1)

shapiro.test(sim4$y) #more divegence from normality
## 
##  Shapiro-Wilk normality test
## 
## data:  sim4$y
## W = 0.98703, p-value = 0.008507
qqnorm(sim4$y)  

m6 <-glm(y~x1*x2, data = sim4)
summary(m6) #significant interaction term in model
## 
## Call:
## glm(formula = y ~ x1 * x2, data = sim4)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.9629  -1.4165  -0.1032   1.4284   4.9957  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.03546    0.11995   0.296  0.76772    
## x1           1.82167    0.18792   9.694  < 2e-16 ***
## x2          -2.78252    0.18792 -14.807  < 2e-16 ***
## x1:x2        0.95228    0.29441   3.235  0.00136 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 4.316076)
## 
##     Null deviance: 2674.6  on 299  degrees of freedom
## Residual deviance: 1277.6  on 296  degrees of freedom
## AIC: 1296
## 
## Number of Fisher Scoring iterations: 2
anova(m6, test="Chisq") #Looks solid.
## Analysis of Deviance Table
## 
## Model: gaussian, link: identity
## 
## Response: y
## 
## Terms added sequentially (first to last)
## 
## 
##       Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                    299     2674.6              
## x1     1   405.59       298     2269.0 < 2.2e-16 ***
## x2     1   946.29       297     1322.7 < 2.2e-16 ***
## x1:x2  1    45.16       296     1277.6  0.001218 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
m7 <-glm(y~x1+x2, data = sim4)
summary(m7) #missed interaction term here and given distribution exploration, likely important.
## 
## Call:
## glm(formula = y ~ x1 + x2, data = sim4)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.5514  -1.3859  -0.1107   1.4928   4.7180  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.03546    0.12184   0.291    0.771    
## x1           1.82167    0.19089   9.543   <2e-16 ***
## x2          -2.78252    0.19089 -14.577   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 4.453581)
## 
##     Null deviance: 2674.6  on 299  degrees of freedom
## Residual deviance: 1322.7  on 297  degrees of freedom
## AIC: 1304.5
## 
## Number of Fisher Scoring iterations: 2
m.b2 <-anova(m6, m7, test="Chisq")
m.b2
## Analysis of Deviance Table
## 
## Model 1: y ~ x1 * x2
## Model 2: y ~ x1 + x2
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)   
## 1       296     1277.6                        
## 2       297     1322.7 -1  -45.155 0.001218 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Case study #5: real data diamonds

#Lucy.
#See data viz week for first steps.
#real data always more complex
ggplot(diamonds) +
  geom_point(aes(carat, price, colour = cut))

#so price is the most likely response we want to know about!
#two key factors, different x.classes, cut is categorical, and carat is continous
#look at price distribution (likely need to do by each x)
ggplot(diamonds) +
  geom_freqpoly(aes(price)) #long tail
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#now do by the two xs to see how y varies
ggplot(diamonds, aes(carat, colour = cut)) +
         geom_freqpoly(binwidth = 0.1)

set.seed(1000)
dsmall<-diamonds[sample(nrow(diamonds), 1000), ]
shapiro.test(dsmall$price) 
## 
##  Shapiro-Wilk normality test
## 
## data:  dsmall$price
## W = 0.78845, p-value < 2.2e-16
qqnorm(dsmall$price)  

#ok, so we have a handle on the distribution.
#certainly non-normal.

#last idea!
#before we move to fitting a model recognizing that the data look like negative binomial or poisson given the class, what if there are a relationship between the the xs?
#what if larger diamonds cost more and better cut diamonds costs more but there are not more better cut AND large diamonds out there? Covariation is not positive.

#EDA on just that
ggplot(diamonds, aes(cut, carat)) +
  geom_boxplot()

m8 <-glm(carat~cut, data=diamonds)
summary(m8) #looks different! More poor cut diamonds are larger...
## 
## Call:
## glm(formula = carat ~ cut, data = diamonds)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8261  -0.3764  -0.1064   0.2580   3.9639  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.859299   0.002983 288.033  < 2e-16 ***
## cut.L       -0.203597   0.007991 -25.478  < 2e-16 ***
## cut.Q        0.038498   0.007123   5.405 6.52e-08 ***
## cut.C       -0.135611   0.006199 -21.877  < 2e-16 ***
## cut^4       -0.045096   0.004998  -9.022  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.2167356)
## 
##     Null deviance: 12119  on 53939  degrees of freedom
## Residual deviance: 11690  on 53935  degrees of freedom
## AIC: 70604
## 
## Number of Fisher Scoring iterations: 2
library(lsmeans)
## Loading required package: estimability
lsmeans(m8, "cut", adjust="tukey") # to see ls means
##  cut          lsmean          SE df asymp.LCL asymp.UCL
##  Fair      1.0461366 0.011602517 NA 1.0233961 1.0688772
##  Good      0.8491847 0.006646628 NA 0.8361575 0.8622118
##  Very Good 0.8063814 0.004235413 NA 0.7980801 0.8146826
##  Premium   0.8919549 0.003964307 NA 0.8841850 0.8997248
##  Ideal     0.7028370 0.003171257 NA 0.6966214 0.7090525
## 
## Results are given on the identity (not the response) scale. 
## Confidence level used: 0.95
#ok now ready for a simple model.
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked _by_ '.GlobalEnv':
## 
##     survey
## The following object is masked from 'package:dplyr':
## 
##     select
m9 <-glm.nb(price~carat*cut, data = diamonds)
summary(m9)
## 
## Call:
## glm.nb(formula = price ~ carat * cut, data = diamonds, init.theta = 7.433464732, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -7.8981  -0.7526  -0.0835   0.5011   5.1836  
## 
## Coefficients:
##              Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)  6.290973   0.005136 1224.818  < 2e-16 ***
## carat        1.955494   0.004778  409.299  < 2e-16 ***
## cut.L       -0.363561   0.014021  -25.930  < 2e-16 ***
## cut.Q        0.300973   0.012388   24.295  < 2e-16 ***
## cut.C       -0.286111   0.010504  -27.238  < 2e-16 ***
## cut^4       -0.035146   0.008200   -4.286 1.82e-05 ***
## carat:cut.L  0.531523   0.012490   42.555  < 2e-16 ***
## carat:cut.Q -0.325109   0.011250  -28.900  < 2e-16 ***
## carat:cut.C  0.367635   0.010142   36.249  < 2e-16 ***
## carat:cut^4  0.086441   0.008432   10.251  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(7.4335) family taken to be 1)
## 
##     Null deviance: 392022  on 53939  degrees of freedom
## Residual deviance:  55084  on 53930  degrees of freedom
## AIC: 887545
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  7.4335 
##           Std. Err.:  0.0444 
## 
##  2 x log-likelihood:  -887523.2510
anova(m9, test="Chisq")
## Warning in anova.negbin(m9, test = "Chisq"): tests made without re-
## estimating 'theta'
## Analysis of Deviance Table
## 
## Model: Negative Binomial(7.4335), link: log
## 
## Response: price
## 
## Terms added sequentially (first to last)
## 
## 
##           Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                      53939     392022              
## carat      1   333900     53938      58122 < 2.2e-16 ***
## cut        4      843     53934      57280 < 2.2e-16 ***
## carat:cut  4     2195     53930      55084 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Tips lme4 for mixed models vegan for ordinations Lavaan for SEMs MASS for count and negative binomial data (1 | factor) treats as random factor

Ensure you see the different applications of the following models: anova lm glm glmm

EDA from a data science perspective: predictive

Practical skill outcomes of R stats useful for competency test Worflow description complete now.

Be able to use EDA & data viz to select a model. Be able to explore distributions of datasets. Be able to fit descriptive models (super simple to simple). Predictive models if you like too but not required. Be able to examine efficiacy of a model.

Recognize through application of a few models that the following rule is never broken in stats…

Rule: Statistics are never prescriptive.

Processes include description or prediction. Models are powerful, purposeful tools you can use to capture & communicate evidence.

Homework Revisit survey, ttc, or footprints data and end with a model.

Additional readings

GLMM for ecologists

A practical guide to linear models

General how to choose the right test

SUPER flowchart

Interpreting R output

Lesson 5. Wrangle, visualize, and analyze.

Here is a webinar I like and also a good book chapter.

Tutorial on reading data into R

Great read on efficient data carpentry

Efficient statistics slide deck

Halloween Hackathon Costumes options. Candy provided. Now, we practice from start to finish including submission of the r-script or rmd you write to turnitin.com

A graduate-level dataset. Apply your new mathemagical skills from scratch. A single three-hour block. As advanced as you want to be. Fundamental principles demonstrated. At least two fundamental statistical tests demonstrated.

Practical outcomes to demonstrate Rename some variables. Address missing values. Generate a simplified table/dataframe. Visualize the data to identify patterns and relationships. Produce a publication quality graphic. Do EDA on data very broadly. Do a single statistical address to capture main effect observed. Annotate

Rubric A total of 25% so 5 questions each worth 5 points. Likert Scale of 1 to 5 with 1 being low and 5 being awesome.

  1. Can I understand what was done?
  2. Can I repeat what was done?
  3. Does the data viz effectively and thoroughly examine and show patterns/relations?
  4. Is the EDA a clear and appropriate examination the evidence and demonstrates statistical thinking?
  5. Is the final graphic and statistical test appropriate (tidy, polished, and meaningful) and suitable for publication?

Lesson 6. Competency test.

General best practices on scientific computing

deliberate practice. tested with feedback. a virtual repeat of last week with new data but evaluated. Pizza provided this week.